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Abstract 

Treebank formats and associated software tools are proliferating rapidly, with little consideration for interoperability. We survey a wide 
variety of treebank structures and operations, and show how they can be mapped onto the annotation graph model, and leading to an 
integrated framework encompassing tree and non-tree annotations alike. This development opens up new possibilities for managing and 
exploiting multilayer annotations. 



1. Introduction 

There is a proliferation of treebanks, each with its own 
format and software tools. Examples include the Penn 
Treebank, the Prague Dependency Treebank, the Susanne 
Corpus, and treebanks of German, Spanish, Portuguese, 
French, Italian, Turkish, Polish, Bulgarian, Old English, 
and the recent development of Korean, Arabic and Chinese 
treebanks. Each treebank is associated with tools for anno- 
tation, search, and rendering. Despite the obvious benefits 
of interoperability, the tools associated with any given tree- 
bank rarely escape the confines of its own project. More- 
over, treebanks both require and invite multilayer capabili- 
ties. Parsers depend on tokenizers, taggers, and morpholog- 
ical analyzers. Layers of annotation such as sense tagging 
and named entity tagging are built on top of treebanks. Dis- 
fluency tagging, as combined with treebanking in switch- 
board, adds another layer of indirection between parses and 
the surface string. In short, necessity dictates the integra- 
tion of treebanks into a general multilayer annotation sys- 
tem, coupled with the development of a logical model and 
corresponding API which address the linguistic demands of 
treebanking. 

Linguistically, the development of such a framework 
leads to some interesting challenges. Grammars and the- 
ories of syntax yield structures which stretch the simplis- 
tic notion of trees over surface strings (such as empty con- 
stituents, encoding of deep syntactic structure, pure depen- 
dency structures, etc). As advances in information extrac- 
tion and language understanding bridge syntax and seman- 
tics, syntactic trees are growing various forms of seman- 
tic annotation. A case in point is the English Propbank, in 
which sentences are annotated with many fine grained se- 
mantic relations (or propositions) whose arguments in turn 
point to relevant syntactic substructures such as individual 
nodes or trace chains. The design and development of a 
system which aptly addresses these issues is certainly non- 
trivial. 

In this paper we examine convetional phrase structure 
trees, dependency trees, and semantic trees. In each of these 
categories, we first survey the data formats and editing op- 
erations, outline an abstract API for the structural opera- 
tions involved, and describe an implementation with anno- 
tation graphs. 



A variety of treebank formats and models are covered 
by the survey. Sources of this variety are both linguis- 
tic and computational. On the linguistic side, languages 
may permit a greater or lesser degree of word-order free- 
dom. In some cases, the conventional tree representation 
requires crossing branches. This happens to a limited ex- 
tent in English, with phenomena such as adverbials and 
extraposition. However it is pervasive in languages hav- 
ing rich case-marking systems such as Czech. Treebanks 
for these languages typically use a dependency represen- 
tation instead of the conventional tree representation. On 
the computational side, projects may have different prior 
commitments to file formats. The file format may simply 
be derived from the original Penn Treebank format, or be 
a novel plain text format, or be one of a variety of possi- 
ble XML representations. To a considerable extent, these 
formats are inter-translatable. Another source of variation 
is the kind of information which is annotated, and the sur- 
vey includes some recent work on semantic annotation and 
predicate-argument tagging. 

After reviewing a diverse set of treebank projects, we 
consider the kinds of tree-manipulation operations they re- 
quire, leading to an inventory of elementary tree opera- 
tions. These operations may be composed with each other 
to perform complex tree manipulations. Next, we show 
how the operations can be implemented in the annotation 
graph model (Bird and Liberman, 2001). This mapping 



has an important consequence for multilayer annotations, 
for now treebanks can co-exist with a variety of other an- 
notation types, such as prosodic and discourse level an- 
notations. With all the annotations expressed in the same 
data model, it becomes a straightforward matter to investi- 
gate the relationships between the various linguistic levels. 
Modeling the interaction between linguistic levels is a cen- 
tral concern both for the study of human communicative 
interaction, and for the construction of naturalistic spoken 
language dialogue systems. 

This inventory of elementary tree operations leads to 
a new application programming interface for treebanking, 
built on top of the existing annotation graph API which is 
implemented in the Annotation Graph Toolkit (Maeda et 
al., 2002). This implementation work is ongoing, and will 
be released under an open source license with AGTK. 



2. Conventional Syntax Trees 
2.1. Survey 

The Penn Treebank was the first syntactically anno- 
tated corpus, and consists of one million words of manu- 
ally parsed text from the Wall Street Journal (Marcus et al., 
1993). An example of the Treebank format is shown below. 

( (S (NP-SBJ-l 

(NP Yields) 
(PP on 

(NP money-market mutual funds})) 
{VP continued 
(S (NP-SBJ *-l) 
(VP to 

(VP slide) ) ) 

(PP-LOC amid 
(NP signs 
(SBAR that 

(S (NP-SBJ portfolio managers) 
(VP expect 

(NP (NP further declines) 
(PP-LOC in 

(NP interest rates))))))))) 

.) ) 

The empty constituents, called traces, represent vari- 
ous forms of syntactic movement that serve to normalize 
the underlying grammar. In this example there is a trace 
immediately preceding the infinitive to slide. This 
node is an empty consituent and refers to the phrase Yields 
on money-market mutual funds, as is indicated by the fact 
that both nodes share the - 1 label. The movement of this 
nominal phrase to the nominal position in the infinitival S 
clause normalizes this clause so that its constituents are NP 
followed by VP. In the Penn Treebank, traces are also used 
to indicate WH and other pronominal movement. Full de- 
tails can be found in the annotation guidelines. 

The data in the Penn Treebank were created with an 
Emacs mode called parser-mode. The tool parses files of 
bracketted text in various stages of the corpus development, 
starting with output from the automatic parser Editing 
operations add function tags, relabel, coindex, insert, and 
delete constituents, and relocate subtrees from constituent 
to constituent. Each of these operations is associated with 
a handful of constraints, such as preservation of the surface 
string. Notably, this constraint on the tree editing opera- 
tions leads to a set of tree editing functions for the user 
closed under the following structural manipulations: pro- 
motion of a leftmost or rightmost constituent, insertion of 
a constituent, insertion/deletion of empty constituents, and 
the movement of a constituent to its right (left, respectively) 
sibling's leftmost (rightmost, respectively) child position. 

In the rest of this section we consider various extensions 
to the Penn Treebank format. 

The Switch board corpus of conversational speech (G od- 
frey et al., 1992) was later enrich ed with inform ation about 
breath groups and disfluencies ( Taylor, 1995 ). This new 
information is simple enough on its own, e.g.: 

B.22: Yeah, / no one seems to be adopting it. / 

Metric system, [ no one's very, + F uh, no one wants ] 
it at all seems like. / 

However, the disfluency information was also superim- 
posed on the syntactic trees, resulting in extremely complex 
structures such as the following: 



( (S (NP-SBJ-l no one) 
(VP seems 

(S (NP-SBJ *-l) 

(VP to (VP be (VP adopting (NP it)))))) . 
E_S) ) 

( (S (NP-TPC Metric system) , 
(S-TPC-1 (EDITED (RM [) 

(S (NP-SBJ no one) 

(VP 's (ADJP-PRD-UNF very))) , 
(IP +) ) (INTJ uh) , 
(NP-SBJ no one) 

(VP wants (RS ]) (NP it) (ADVP at all))) 

(NP-SBJ *) 

(VP seems (SBAR like (S *T*-1) ) ) . E_S) ) 

This format demonstrates the acute problem that arises 
when we attempt to force one linguistic structure into a for- 
mat that was designed for representing a completely differ- 
ent kind of structure. 

A more conservative extension of the Penn Treebank 
format is the UAM Spanish Treebank ( fvloreno et al., 2000| ) 
In this format, the treebank node labels have a record struc- 
ture: 

(s 

(NP SUBJ ID-1 SG P3 

(ART "<E1>" "el" DEE MASC SG) 
(N "<Gobierno>" "Gobierno" SG P3) ) 
(VP TENSED PRES IND SG P3 

(V "<quiere>" "querer" TENSED PRES IND SG P3) 
(CL INFINITIVE OBJl 
(NP * SUBJ REF-1) 
(VP UNTENSED INFINITE 

(V "<subir>" "subir" UNTENSED INFINITE) 
(NP OBJl 

(ART "<los>" "el" INDEF MASC PL) 

(N "<impuestos>" "impuesto" MASC PL)))))) 

Emacs is used for creating the structures, and a tree dis- 
play tool is used for verification. Various other tools check 
for well formedness (e.g. of the node attributes and gram- 
matical structures). 

Other treebanks use the same conventional nested struc- 
ture, but with a different syntax. For example, consider the 
following fragment from the Portuguese Treebank [http : 
//cgi.portugues. met . pt /treebank/]. 

<s> 

SOURCE: CETEMPblico n=l sec=clt sem=92b 

Cl-2 O 7 e Meio urn ex-libris da noite algarvia. 

Al 

STA: f cl 
SUBJ: np 

=>N:art (M S) O 
^H:prop(M S) 7_e_Meio 
P : v-f in (PR 3S IND) 
SC : np 

=>N : art (<arti> M S) um 
=H:n(M P) ex-libris 
=N< :pp 

--H : prp ( <sam-> ) de 
==P< : np 

===>N : art (<-sam> F S) a 
===H:n(F S) noite 
===N<:adj(F S) algarvia 

</s> 

Emacs macros are used to edit the data, with operations 
for insertion and deletion of nodes as well as increasing 
and decreasing the depth of the nodes in the tree. Some 
tree structural constraints are enforced: whenever a node's 
depth is increased, so are all of its constituents, and all 
nodes must have a label. 

Finally, XML is now being used to represent treebanks. 
The simplest and most direct way to do this is to use el- 
ement nesting to represent hierarchy. An example of this 



use of XML is provided by the French Treebank (Abeille 
et al., 2000), and we show a translation below, [http: 
/ /treebank . linguist . jussieu.fr/]. 

<s> 

<NP>The proportion:NC 

<PP>of:P students :NC</PP> 
</NP> 

<PP>compared to:P 

<NP>the population :NC 
<PP>of :P 

<NP>our:D country :NC</NP> 

</PP> 
</NP> 
</PP> 

<PONCT>, :PONCT</PONCT> 
[rest of sentence elided] 
</S> 

It is notable that the part of speech labels are structured by 
convention in the embedded text rather than by using XML 
markup. 

2.2. API 

Many conventional tree operations, such as adding, 
moving or deleting a subtree, also modify the sequence of 
terminals (or leaves). In syntactic annotation, this sequence 
is usually fixed, since it is an external artefact which is not 
subject to editing. Therefore, we need to provide a com- 
plete inventory of tree operations which preserve the termi- 
nal string. 

Many treebanking projects incorporate a preprocessing 
phase, which may create some low-level constituents (such 
as noun phrase chunking) or may create an entire parse of 
the sentence. Therefore, the inventory of tree operations 
must be capable of reorganizing the structure of an existing 
tree, not just building a tree from scratch. 

In this section we define an inventory of elementary tree 
operations which preserve the terminal string and which is 
sufficiently expressive to permit any well-formed phrase- 
structure tree to be built over the terminal string, beginning 
either from an unparsed string or from a previously parsed 
string. The inventory is inspired by the various operations 
that are provided by existing tree annotation tools. We con- 
sider only those operations which modify the structure of 
a tree (as opposed to the operations for modifying node la- 
bels). 

Each operation requires a tree t along with a selected 
node n. We write t„ for the tree t oriented at node n. 

move down mi This creates a new node n in the po- 
sition formerly occupied by n, and makes n the sole 
child of n. The new node n is an unlabeled non- 
terminal symbol. For example, under this operation, 
the tree on the left becomes the tree on the right: 



BCD 



move up m|(t„) This applies only if n has no siblings, 
deleting n, the parent of n. Node n now occupies the 
former position of n. 






promote right m /^(tn) This applies only if n has at least 
one sibling, but no siblings to its right. Node n is 
moved up to the position immediately to the right of 
its parent n. 



A 
I 

B 



A 

I 

C 



promote left m'\ (t „ ) mirror image operation of m/ „ ) . 

demote right m\^(t„) This applies only if n has a sibling 
to the right it, and if is a non-terminal. Node n be- 
comes the leftmost child of it. 



A 

B^ 
I 

D 



A 
I 

C 

b'^ 



demote left m^{tn) mirror image operation of m\^(t„) 

All operations preserve the orientation of the tree; the 
selected node remains selected after the operation. Ob- 
serve that all operations have inverses: rn^^(t„) = 
''^x/itn) = tn, m^{tn) = tn- AH of these operations 
preserve the order of the terminal string, and all are ele- 
mentary as none can be expressed as a combination of any 
others. 

More complex operations can be built from these ele- 
mentary operations. For instance, in a particular user inter- 
face, it may be possible for a user to select a set of contigu- 
ous terminals and and non-terminals, and group them under 
a new non-terminal: 



A 



BCD 



B 



C D 

This can be done with a sequence of operations: 
m[{tc),my{tD)- This is a generalized move down opera- 
tion, for which there is an corresponding generaUzed move 
up. 

Note that there is another pair of elementary operations 
not discussed above, that could be called trace-insertion and 
trace-deletion. These involve the creation/deletion of a zero 
width element in the terminal sequence (or equivalently, of 
a "non-terminal" which dominates no terminal). 

2.3. Implementation 

Bird and Liberman have developed a model for express- 
ing the logical structure of hnguistic annotations, and have 
demonstrated that it can encode a great variety of existing 



annotation types ( Bird and Liberman, 2001 ). An annota- 
tion graph is a directed acyclic graph where edges are la- 
beled with fielded records, and nodes are (optionally) la- 
beled with time offsets. The model is implemented in the 
Annotation Graph Toolkit and used as the basis for sev- 
eral annotation tools, including one for editing conventional 
syntax trees ( |Maeda et al., 2002| ; [Bird et al., 2002| ). 

Annotation graphs can most easily be used to represent 
trees using the so-called "chart construction," in which each 
tree node is mapped to an annotation graph arc. An exam- 
ple tree and its corresponding annotation graph are shown 
below: 

A ^^^A^^^ 

• B — s- • C — 9- • 

B C 

This approach has two shortcomings. First, in the situa- 
tion where a non-terminal has a single child, the annotation 
graph is ambiguous. Thus, the following two simple trees 
have the same annotation graph representation: 

B A 

The second shortcoming is that the annotation graph rep- 
resentation cannot express discontinuous constituency (i.e. 
trees that contain crossing lines). 

Both problems can be addressed by using equivalence 
classes or cross references (Bird and Liberman, 2001 ). We 



depict the relation between a child arc and its parent using 
a dotted arrow, as shown below. While this is partly redun- 
dant, it involves minimal overhead. 





The elementary tree operations that we discussed above 
can now be implemented directly in terms of the annotation 
graph model. We begin with some definitions. Let start 
(resp. x.end) be the start (resp. end) anchor of annotation x. 
Let X be x's parent (undefined if x has no parent). Define 
x's right sibling as follows: 



y if y. start = x.end, x — y 

undefined otherwise 



Annotation graph arcs are typed, and our implementa- 
tion requires two types, namely "word" for word arcs (the 
orthographic string), and "phrasal" for the phrasal arcs. 
Now we can define the above tree operations in terms of 
annotation graphs. 

move down Given the arc x, insert a new coterminous arc 
which becomes the parent of x. 



promote right Move a rightmost child to the right, out of 
the subtree; x's parent (y) becomes x's left sibling. 
Note that y must be a phrasal arc. 





demote right Move a subtree right, to become the leftmost 
daughter; x's right sibling y becomes .x's parent. Note 
that y must be a phrasal arc. 



- • 

A 



Observe that none of these operations alter the content 
or arrangement of the word arcs. 

3. Dependency Treebanks 

Dependency grammar is an approach to syntactic repre- 
sentation in which words are organized into a hierarchy us- 
ing a binary "dependency" relation. Dependency trees pose 
a different set of challenges for representation and manipu- 
lation, as discussed in this section. 

3.1. Survey 

The Turin University Treebank ( Bosco et al., 2000 ) pro- 
vides an example of a pure dependency structure, showing 
a binary relation between the words. The treebank con- 
sists of 500 sentences, available from [http : / /www . di . 
unito . it / "tutreeb/]. A sample follows 

1 E' (ESSERE VERB MAIN IND PRES INTRANS 3 SING) [0; TOP-VERB] 

2 Itallano (ITALIANO AD J QUALIF M SING) [ 1 ; PREDCOMPL-SUBJ] 

3 , (# PUNCT) [1; OPEN-PARENTHETICAL] 

4 come (COME CONJ SUBORD MOD+TEMPO) [I;PREPMOD] 

5 progetto (PROGETTO NOUN COMMON M SING) [4;PREPARG] 

6 e (E CONJ COORD) [5; COORD] 

7 reallzzazione (REALIZZAZIONE NOUN COMMON F SING REALIZZARE 
TRANS) [6;COORD-2ND] 

8 , {# PUNCT) [1; CLOSE-PARENTHETICAL] 

9 11 (IL ART DEF M SING) [1;SUBJ] 

10 prlmo (PRIMO ADJ ORDIN M SING) [ 11 ; AD JCMOD-ORDIN] 

11 porto (PORTO NOUN COMMON M SING) [9;NBAR] 

12 turlstico (TURISTICO ADJ QUALIF M SING) [ 11 ; AD JCMOD-QUALIF ] 

13 dell' (DI PREP MONO) [ 1 1 ; PREPMOD-LOC-SPEC ] 
13.1 dell' (LA ART DEF F SING) [13;PREPARG] 

14 Albania (I Albania I NOUN PROPER) [13.1; NEAR] 

This format consists of: the index of the word in the 
sentence; the word; parentheses containing the lemma and 
its morphosyntactic features; brackets containing a refer- 
ence to the parent of this dependent and the name of the 
grammatical relation. 

The Prague Dependency Treebank (PDT) (Hajicova, 
2000) is a corpus with three distinct layers of annotation 
- morphological, analytic (syntactic), and tectogrammatial. 
We won't address the morphological annotation in order 
to focus on more tree and treelike structures. Both an- 
alytic and tectogrammatical structures are represented as 
hybrid dependency trees, mixing a pure dependency rela- 
tion over the words with a minimum of constituents. This 
representation is indicative of the underlying grammati- 
cal theory, functional generative grammar. As the cor- 
pus uses an extensive tagset and views annotations via a 
special tool, we refer the reader to the url above for data 
samples. PDT has an online tree viewer available (see 



I http : // shadow . ms . mf f . cuni . cz/pdt /]). 



The editor for the analytic level restricts the user to op- 
erations that maintain a well formed dependency tree with 
constituent nodes mixed in. In accordance with the rela- 
tively free word order in Czech, the tool allows movement 
of subtrees to arbitrary nodes, along with the creation and 
deletion of constituents. 

Further discussion of the tectogrammatical annotation 
is deferred to section fQ 

The TIGER Project uses a model intermediate between 
conventional trees and dependency trees, represented in 
XML ( ^engel and Lezius, 200C ). The dependency struc- 
ture is represented as a collection of nodes (n elements) and 
words (w elements) connected using edges.[] A simplified 
version is shown below: 

<n ici="nl_500" cat = "S"> 

<ecige href =" #id (wl) " /> 

<edge href =" #id (w2 ) " /> 
</n> 

<w id-"wl" word-"the"/> 
<w id-"w2" word-"boy"/> 

This format can represent arbitrary digraphs. The linear 
ordering of the children of any given node is represented 
by the file order of the corresponding elements (or by the 
internal structure of node identifiers). 

An important property of this format is its extensibil- 
ity. For instance, edges can be typed (with an attribute 
type, and coreference is marked using edges having 
type="semantic". Edges can also be labeled with the 
grammatical role of their dependent (e.g. label="HD" 
for the head daughter). 

3.2. API 

An API for the structural editing of pure dependency 
trees is remarkably simple. We start with an arbitrary root 
node, and make all the words dependent upon this node. 
From this point, we can create any dependency relation by 
iterative application of a single move subtree operation, 
which takes a source node other than the root and a target 
node and makes one dependent upon the other. Thus, after 
an annotator identifies a single dependency, we may see a 
tree as follows. 



Tree 1 



Root 



Wl W2 W4 



W3 

Since the word order is free, it may be that Wi is de- 
pendant on W4. To accomodate for this, we can either let 
the branches of the tree cross and retain the terminal order, 
or we can rearrange the terminal order so that the branches 
don't cross. After move subtree is applied to source Wi 
and target W4, we would attain the following tree 



Tree 2 



Root 




' A more abstr act version of the same idea is described by I de 
and Romary (2000). 



But some systems may use an underlying grammar 
which mixes pure dependency structure and a constituent 
based approach, as is found in the PDT. Such an approach 
allows the insertion of constituent nodes, equivalent to the 



move down operation described for basic trees in §2.2 



Tree 3 



Root 



Wl W2 C W4 



W3 



Such a constituent may then interact with the others just 
like the pure dependency nodes associated with a single 
word. For example, after two move subtree operations, 
we may end up with the following. 



Tree 4 



Root 



W2 



W3 W4 



Wl 



A user interface may facilitate a delete command which 
takes all the children of a proper constituent node and 
moves them to the parent of the deleted node, deleting the 
resulting empty constituent. 



Trees 



Root 



W2 W3 W4 



Wl 



3.3. AG Implementation 

To implement editable dependency trees with annota- 
tion graphs, we begin by defining a root node as an arc 
which spans the length of the sentence. As with basic trees, 
each node in this tree has a parent pointer which by default 
points to the root. The primary editing operation is move 
subtree, which takes a tree and two distinguished nodes 
(wi and W2), setting the parent of Wi to W2. This operation 
is sufficiently expressive to define any structural editing op- 
eration on a pure dependency tree. 

Below we show a simple AG implementation of the 
editing operation move subtree with source wi and target 
W4. 




For hybrid systems which allow constituents, we want 
to constrain the length of the constituent arcs as much as 
possible. In spite of the fact that setting the length of these 
arcs to a constant would reduce overhead, we take this ap- 
proach in anticipation that the quasi-ordering over annota- 
tions will provide a more substantial basis for layered an- 
notation than following pointers. 

We proceed by superimposing the implementation of 
move up and move down directly on top of this and ex- 
tend the definition of move subtree so that it works on 
arbitrary constituents and maintains a well formed hybrid 
structure. We have developed an algorithm for this which 
requires the ability to distinguish between words and proper 
constituents as well as between proper constituents and the 
root node. We accomplish this simply by checking the type 
of the arcs involved. We illustrate these extensions showing 
annotation graph representations of trees 3 and 4 below. 




• — wl'^ • — w2^ • — w3^ • — w4^ • 



Tree 3 




• — wl^ • — w2^ • — w3^ • — w4^ • 



Tree 4 

4. Treebanks and Semantic Trees 
4.1. Survey 

While many semantic relations are described in tree- 
banks, predicate argument structure remains the most com- 
monly and systematically explored. Each treebank formu- 
lates some schema to represent the argument structure of 
clausal verbs, and indeed this information is to some ex- 
tent explicit in the parse itself. To complete the picture, 
the nodes of the parse tree are often decorated with labels 
denoting more abstract relations. In some cases, an entire 
extra level of annotation is supplied separately in a paral- 
lel corpus, as in the Prague Dependency Treebank (PDT). 
In this section we catalog a variety of predicate argument 
schemas, observing commonalities, and exploring require- 
ments inherent in capturing predicate argument structures 
with treebanks. 

The Susanne Corpus, developed as a by-product of a 
parsing schema for unambiguous syntactic annotation, pro- 
vides perspicuous coverage of predicate argument structure 
of clausal verbs. It decorates nodes with a variety of func- 
tion tags, though it restricts their usage to immediate con- 
stituents of clauses. 

[Nns:s John] expected [Nns:0999 Mary] [Ti:o [s999 GHOST] 
to admit [Ni : o it ] ] 

The example above is similar to the Penn Treebank 
example in that it requires coindexed nodes, but un- 
like the English Propbank, it does not use references to 
syntactic nodes. The complexity of predicate argument 



well-formedness constraints together with a close cou- 
pling of syntactic and argument relations are noteworthy 
by-products of embedding these relations in the syntactic 
schema. 

We examine the tectogrammatical level of annotation in 
the PDT, as it represents a more abstract linguistic structure 
closely related to predicate argument structure. These trees 
are of the hybrid dependency variety described in §|] The 
tectogrammatical dependency trees are roughly parallel to 
the analytic ones and their structure is derived by deleting 
and adding nodes to the analytic trees. Spurious elements of 
the surface string are removed and dropped arguments are 
added. While these operations produce the structure of the 
tree, edge labels such as actor, patient, addressee, location 
denote semantic roles and modifiers. 

The Penn Treebank uses attributes of phrase labels in 
conjunction with grammatical relations to describe pred- 
icate argument structure. In the example below, the last 
nominal phrase is decorated with a LGS tag denoting log- 
ical subject. The syntactic environment indicates the re- 
maining parts of the argument structure, with the head verb 
taking the role of the predicate and the preceding noun 
phrase taking on the role of direct object. 

(S 

(NP-SBJ (PRP they) ) 
(VP (VHP attribute) 

(NP (-NONE- *T*-1) ) 

(ADVP-MNR (RB directly) ) 

(PP-CLR (TO to) 
(NP 

(NP (NNS forces) ) 
(VP (VBN controlled) 
(NP (-NONE- *) ) 
(PP (IN by) 

(NP-LGS (NNP PLO) (NNP Chairman) 

(NNP Yasser) (NNP Arafat) ))))))) 

Algorithms for extracting predicate argument structure, 
even from such rich syntactic data, are faced with nu- 
merous complexities and ambiguities. For example, ghost 
constituents without explicit referents should be resolved, 
disjoint constituents may form arguments, prepositional 
phrases may or may not constitute arguments, and this 
information tends to be lexicalized over the predicates 
(Palmer and Rosenzweig, 2001). 

As a next step, the English Propbank is under develop- 
ment, using the predicate argument tagger mentioned above 
and hand-correcting the output. The example of this data 
below shows that the entire argument relation is explic- 
itly marked. Note that the argument label ARGl implic- 
itly refers to specific syntactic nodes rather than the surface 
string, in this case resolving the passive trace. 

. . . they attribute directly to forces controlled 
by PLO Chairman Yasser Arafat . 

rel: controlled 

ARGl: *trace* -> forces 

ARGO-by: PLO Chairman Yasser Arafat 

Additionally, the constituents of a particular argument 
may be disjoint as the utterance argument of a sentence like 

"I'm going home", John said, "so I can get some sleep". 

rel: said 
ARGO : John 

ARGl: [I'm going home] [so I can get some sleep] 

Phrasal predicates, such as give up, are almost never 
dominated by a single node, and so are treated similarly. 



Another source of variation occurs with conjunctions 
over more than one argument. For example, the sentence 
below yields two propositions. 

John drove Mary to the store and Mike home 



rel : 
ArgO : 
Argl : 
Arg2-to : 

rel : 
ArgO : 
Argl : 
Arg2 : 



drove 

John 

Mary 

the store 

drove 
John 
Mike 
home 



In the English propbank, we witness argument struc- 
ture using references to syntactic annotation, a one to many 
relation from arguments to constituents (also vice versa), 
and the marking of sentence-local equivalences to resolve 
grammatical motion. 

In conclusion, capturing predicate argument structure 
is of definite interest in the development of treebanks. In 
all the cases examined, an extra level of indirection from 
the syntactic structure is required. The English Propbank 
makes use of explicit references to syntactic constituents, 
the Susanne Corpus employs highly structured decoration 
of nodes, inducing relations between the nodes, and the 
PDT utilizes differences against the syntactic structure, re- 
placing analytic with semantic functions and recovering 
dropped arguments as necessary. 

4.2. API 

As predicate argument structure has quite varied treat- 
ment, we'll look at both argument structure as treated with 
the Penn Treebank and argument structure as in the Prague 
Dependency Treebank. However, we will restrict ourselves 
to working with predicate argument data as derived from 
syntactic data rather than as derived from scratch in order 
to best address the extant tagging efforts in this domain. 

In the case of the English Propbank, the operations are 
not editing operations on trees per se, but operations on 
relations between constituents in a given tree. For each in- 
stance of a predicate in some parsed text, we can charac- 
terize a proposition as a 4-tuple consisting of the predicate, 
its arguments predicate, its modifiers, and an equivalence 
relation over the nodes in the parse tree. Each of the argu- 
ments or modifiers consists of a label and a non empty set 
of constituents, denoting its surface string content. While 
this set of constituents is often singleton, any non-singleton 
set of constituents represents a surface string which is not 
dominated by a single node (this occurs with phrasal verbs 
and often with the utterance argument in verbs of saying). 
The equivalence relation over the nodes of the parse serve 
to recover dropped arguments (as occurs with empty con- 
stituents) and sentence-local antecedents of pronouns. The 
case of conjunctions whose conjuncts are not dominated by 
a single syntactic node is handled by associating multiple 
propositions with the instance of the predicate (or lemma) 
at hand. 

The editing operations for the annotation process con- 
sist of associating argument labels (e.g. argO . . . argN) 
with constituents and identifying equivalent nodes of the 
parse. For example, annotating the argument structure of 



the predicate swim on the parse tree below (with nodes 
identified in terms of their leftmost terminal number and 
height) would yield a single proposition whose predicate is 
{(3,0)}, whose arguments consist of {{ArgO, {{2,0)})}, 
whose modifiers are 0, and whose equivalences are 
{((2,0), (0,0))}. 

S(0,1) 



NP- 1(0,0) 
I 

John 



VP(1,0) 



wants 



S(2,l) 



NP(2,0) VP(3,0) 



to 



In the PDT tectogrammatical annotation, the operations 
are structurally similar to those of the analytic annotation, 
except that dropped arguments are added to the structure 
and words can be deleted. We defer addressing these issues 
for future work. 

4.3. Semantic Implementation 

We describe an implementation of propbank annotation 
with annotation graphs. Given an annotation graph parse of 
a basic tree as described in §2.3., we first define the pred- 
icating lemma over a set of constituents as an arc whose 
start point is the minimum of the start points of the associ- 
ated constituents and whose end point is the maximum of 
the end points of the associated constituents. For example, 
if the sentence is 

ai John a2 belongs aa to a4 the as club as 
and an is an annotation graph anchor, and our predicating 
lemma is belongs to, then the arc defining our predicate will 
start at a2 and end at 0:4. Just as pointers were added for 
basic tree constituents, we add sets of pointers to this arc to 
the constituents containing belong and to. This arc gets a 
label indicating that it is the predicating label, say pred. 

The arguments and modifiers of the lemma are denoted 
similarly, with an appropriate label for the item in question. 
The end-product is diagrammed below: 




/IN 



Finally, we specify the constituent equivalences by not- 
ing all the non singleton equivalence classes whose mem- 
bers are among those associated with a label. 

5. Discussion and Further Work 

Treebank formats and associated software tools are pro- 
liferating rapidly, with little consideration for interoperabil- 
ity. We have surveyed a wide variety of treebank structures 
and operations, and shown how they can be mapped onto 
the annotation graph model. This has two important ramifi- 
cations, distinguishing our work from previous work. First, 
the false dichotomy between conventional trees and depen- 
dency trees goes away; both types along with hybrid struc- 
tures can be represented in a uniform framework. Second, 



a single comprehensive framework is used for both tree and 
non-tree annotations, an integration that greatly facilitates 
multilayer queries. 

Several aspects of the survey and the analysis are in- 
complete, and we list just three areas here. First, there 
is another class of treebanks used for grammar develop- 
ment, usually consisting of hand-crafted sentences illustrat- 
ing a particular linguistic phenomenon. Each sentence is 
associated with the correct analysis, expressed in a partic- 
ular syntactic formalism such as HPSG (Pollard and Sag, 
1994). An example of this kind of corpus is the HPSG 
Treebank for Polish (Marciniak et al., 2000). Represent- 



ing such treebanks using annotation graphs would require 
a more expressive model of arc labels than is currently per- 
mitted (namely attribute-value matrices). 

A second open question is in the area of bidirection- 
ality. Texts may involve a mixture of directionality, such 
as an Arabic text containing stretches of English. In such 
texts, there is no longer a transparent relationship between 
the sequence of orthographic words and their sequence in a 
spoken utterance; the Unguistic representation needs to en- 
compass both orderings somehow, even though annotation 
graphs force us to choose one of the orderings as primary. 

A third area for further investigation is query. Now that 
the annotations are all expressed in the same framework, 
how do we want to express queries over the annotations? A 
range of tr ee query languages have been proposed, as dis- 
cussed by Z!assidy and Bird (2000). It is highly unlikely 
that a single tree query language will ever meet the require- 
ments of all research projects. Instead, we plan to investi- 
gate a number of tree query languages and their mapping 
to a low-level annotation graph query language, such as the 
one proposed by Bird et al. (2000). 

In this article we have surveyed treebanks, examining 
their data formats and editing operations. We have found 
that the existing treebank models do not accomodate over- 
layed annotation very well. We have developed abstract 
APIs for treebanking operations which encompass the re- 
quirements of conventional trees, dependency trees, and 
even predicate argument structure. We have described how 
these APIs may be directly implemented using annotation 
graphs. This facilitates multi-layered annotations and lever- 
ages the array of annotation types that are already supported 
by the annotation graph model. 
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